As statisticians, data analysts, data scientists or any other practitioners who work with data, our job is all about getting insights from data. But where does this data come from? Knowing where a dataset comes from and where and how it was collected is crucial for us to determine if it will allow us to derive the insights we need.
Example: Ahmed is a consumer scientist working for Freshdale, a South African dairy company. His job is to determine whether Freshdale should invest more money into producing milk, cheese or yoghurt. For each of the datasets below, discuss if it is appropriate to answer this question.
Possible answers:
In the above example, the data sources were all examples of secondary data sources. The next sections will introduce primary, secondary and tertiary data sources, provide examples of these data sources, and discuss scenarios in which they are appropriate. The last section in this chapter will present examples of where to find open data online.
Primary data sources are collected first-hand for the purpose of a specific study. Examples of primary data sources include:
Can you think of other examples of primary data sources?
Remember: A primary data source is always created by the researcher themselves to answer a specific question or questions.
Primary data source pros and cons:
Pros:
Cons:
Secondary data sources are pre-collected by a different researcher for a purpose other than the current research question. In the Freshdale example, the surveys and datasets that Ahmed could choose from were collected by other researchers, in different years and different countries, to answer their own questions.
Consider the survey on dairy product consumption in the rural population of KwaZulu-Natal. Let us say that Ahmed’s colleague Amina created that dataset. For Amina, it was a primary data source, since she collected the data herself in order to answer a specific question. Ahmed did not collect that dataset, and would be using the dataset to answer a different question. Thus, it would be a secondary data source for Ahmed.
Examples of secondary data sources include:
Can you think of other examples of secondary data sources?
Remember: A secondary data source was created by another researcher to answer a different research question.
Discuss: If Ahmed conducts a survey on dairy consumption in South Africa in 2025, is it a primary or secondary data source? If Ahmed then uses this data in 2026 to analyse only milk consumption, is it a primary or secondary data source?
Secondary data source pros and cons:
Pros:
Cons:
Note: Secondary data is not guaranteed to be clean, accurate, reliable or ethical. As a researcher using the data, it is still your responsibility to check these aspects before using the data in your own research.
Tertiary data sources summarise and describe the information contained in primary and secondary data sources. They can provide a very useful starting point to study, understand and research a particular topic. However, they cannot be used to answer research questions in the same way as primary and secondary data sources.
Examples of tertiary data sources include:
Can you think of other tertiary data sources?
Remember: A tertiary data source is a summary of primary or secondary data sources.
In the Freshdale example, Ahmed might want to read a review paper discussing the latest research on dairy consumption.
The list below describes some sources of open international data.
The list below describes some sources of open African data.
As a statistician, data analyst, data scientist or a professional in any similar career, it is important to understand the different types of data. Intuitively, this makes sense: we understand that someone’s height is fundamentally different information from their blood pressure, favourite vegetable, or their name. We also understand that these different kinds of data cannot be directly compared.
As an example, consider two friends, Shamila and Kagiso. Shamila is 160cm tall, has a systolic blood pressure of 110, and likes broccoli. Kagiso is 180cm tall, has a systolic blood pressure of 119, and likes carrots. We can say that Kagiso is taller than Shamila, but we cannot say that Kagiso is taller than his blood pressure. We can say that Kagiso’s blood pressure is higher than Shamila’s, but we cannot say that Kagiso’s blood pressure is higher than broccoli (what would that even mean?). These differences may seem obvious in this example, but keep them in mind as we explore the different data types and scales of measurement.
Scales of measurement refers to the categories in which we divide data according to their properties, and which inform the appropriate kinds of analyses we can perform on the data. There are four scales of measurement, namely nominal, ordinal, interval and ratio.
Consider the example dataset below. This dataset displays some administrative, demographic, economic and agricultural data of the provinces of South Africa.
Figure 1: Provinces dataset
Nominal data refers to data to which there is no order or ranking. In the Provinces dataset, there are three nominal variables. Province is a nominal variable, since there is no specific order to the provinces. Province where most of the population migrate is similarly a nominal variable.
Q: What is the third nominal variable in the Provinces dataset?
Nominal data is best represented by bar charts and pie charts.
Figure 2: Pie chart of the ‘Province where most of the population migrate’ nominal variable.
Figure 3: Bar chart of the ‘Province where most of the population migrate’ nominal variable.
Q: Would it be useful to make a pie chart or a bar chart of the Province variable? Why or why not?
Ordinal data refers to data that has an order, but where the difference in order cannot be measured. In the Provinces dataset, HDI (Human Development Index) is an ordinal variable. There is a clear order to the values of HDI, namely Medium or High. However, the difference in order cannot be measured quantitatively. We cannot, for instance, say that the Free State’s HDI is “twice as medium” as that of the Eastern Cape, or that North West’s HDI is “a third as high” as that of Western Cape.
Q: Can you think of other examples of ordinal variables?
Ordinal data is best represented by bar charts.
Figure 4: Bar chart of the ‘HDI’ ordinal variable.
Interval data refers to data that has an order, and where the difference in order can be measured, i.e. it can be quantified by a numerical value. However, the ratio between interval data values does not have a meaning, and there is no true zero. In the Provinces dataset, Summer temp is an interval-valued variable. This is because temperature does not have a true zero value. A temperature of 0 degrees Celsius is cold; it does not mean that there is no temperature.
Interval data can be represented by, among other things, bar charts, histograms, and box plots.
Figure 5: Box plot of the ‘Summer temp’ interval variable.
Q: Can you think of other examples of interval variables?
Ratio data refers to data that has an order, where the difference in order can be measured quantitatively, and where the ratio between values has a meaning. The data also has a true zero. In the Provinces dataset, Land area, Population density, and % of agricultural households are examples of ratio variables.
Like interval data, ratio data can be represented by, among other things, bar charts, histograms, and box plots.
Figure 6: Histogram of the ‘% of households with no internet’ ratio variable.
Q: Are the other ratio variables in the Provinces dataset? If there are, which ones are they?
Give the scale of measurement of the following variables and explain your answer.
If you are struggling to identify the scale of measurement of a variable, ask yourself the following questions:
Data comes in many different forms, such as numbers, text, images, GPS coordinates, and many more. In this course, the two main data types we will consider are quantitative and qualitative data. Quantitative data has numerical values, and can be analysed using mathematical and statistical methods. Qualitative data has descriptive or categorical values.
Quantitative data can be measured or counted and expressed numerically. It is always expressed as numerical values. It is used to quantify concepts such as “how much,” “how many,” or “how often.” it can be analysed using mathematical operations such as addition, subtraction, and more advanced statistical analysis.
In the Provinces dataset, Population size, % of agricultural households, Land area (sq km), Population density (per sq km), % of households with no internet access, Sex ratio, Median age and Summer temp are quantitative variables.
Q: Why is the variable Coastal (1) or Inland (2) not considered a quantitative variable, even though it is represented by a number?
Quantitative data is categorised into two main types: discrete and continuous.
Discrete data is data that can be counted, and thus has integer values. Discrete data does not have fractions or decimals. In the Provinces dataset, Population size is a discrete variable. This is because it represents the number of people in a province. The number of people can be counted, and will always have integer values. One cannot have ‘0.75 of a person’!
Typically, discrete data can be expressed as ‘the number of’ something.
Q: What other examples of discrete data can you think of?
Continous data is data that can be measured, and thus has real values. It can have fractions and decimals. In the Provinces dataset, % of households with no internet access is a continuous variable. This is because it represents a percentage, which can have decimals.
Q: What other examples of continuous data can you see in the Provinces dataset?
Qualitative data is used to categorise or describe phenomena. The scales of measurement associated with qualitative data are nominal and ordinal data. Answers to open-ended questions can also be examples of qualitative data.
In the Provinces dataset, Province, Coastal (1) or Inland (2), HDI and Province where most of the population migrate are qualitative variables.
Other examples of qualitative data that you might encounter include:
Q: What other qualitative variables can you think of?
Qualitative data can best be visualised by bar charts, pie charts, and word clouds. The image below shows a word cloud of keywords associated with tourism in South Africa.
The below table summarises the key aspects of quantitative and qualitative data, and the key differences between them.
## Warning: package 'knitr' was built under R version 4.3.3
| Data type | Qualitative data | Quantitative data |
| Nature | Descriptive, categorical | Numerical, measurable, countable |
| Values | Words, labels, categories | Numbers, counts, measurements |
| Examples | Province names, brands, opinions | Population size, percentages, income |
| Visualisation | Bar chart, pie chart, word cloud | Histogram, scatter plot, line chart |
| Mathematical operations | Not applicable | Applicable (e.g. sum, average) |
The data available in the world today is growing exponentially in volume and diversity. Social media, fitness apps, website cookies, videos, GPS devices, satellites and many more sources of data are producing thousands of gigabytes of data every day. Although this chapter focused on quantitative and qualitative data types, with nominal, ordinal, interval and ratio scales of measurement, many more data types exist in our rapidly changing world. It is thus important for you to take notice of some common additional data types, which this section will highlight.
Date and time data can be measured as ordinal, interval or ratio data, depending on its nature.
Time series data represent measurements taken over time, usually at regular intervals. It is used to study and understand patterns happening across time, and predict what may happen in the future. Examples of time series data include:
Time series can often be represented by line plots. Figure 7 shows the line plots of average monthly maximum temperatures for each of the provinces.
Figure 7: Time series of monthly maximum temperature per province
Image data is popular in the field of computer vision and AI. Images are typically represented as a set of three matrices, or grids, representing the red, blue and green values of each pixel in the image.
Figures 8 and 9 show an example of an image, and its decomposition into its red, green and blue components.
Figure 8: Example of image data (Photo by Andrew S on Unsplash)
Figure 9: An image split into its red, green and blue components
Images can also consist of more grids. Satellite images, for example, typically have additional grids with infrared values, water and vegetation indices, etc. These can be used to measure the presence of vegetation, water, and buildings, assess the health of vegetation, monitor climate change, and much more.
Spatial data is concerned with the locations of phenomena. Spatial data can include satellite images, GPS locations, GPS routes, and more. Figure 10 shows an example of a satellite image of central Pretoria, as well as GPS locations of some points of interest, and lines representing the roads. These are all examples of spatial data.
Figure 10: Top left: A satellite image; Top right: points of interest; Bottom left: roads; Bottom right: All previous datasets overlaid
Spatial data is used in urban planning, ecology, epidemiology (the study of disease spread), and more.
Audio data can be found in sound files. This can include, among other things, music, audio tracks for movies, and voice recordings (like voice notes). Audio data is represented by a time series signal where the amplitude of the sound wave is sampled at regular intervals. Figure 11 shows an example of an audio file.
Figure 11: An audio file represented as a time series of amplitudes. Higher amplitudes represent a louder volume.
Video data are series of images put together. Videos without sound are examples of video data. Videos with sound, such as movies downloaded from the internet, or videos taken with your phone, are a combination of video data (the visuals) and audio data (the sound).
The last part of this chapter considers how data of different types are stored on computers. We will look at the possible values that the data can be stored as, and common file extensions for each data type.
Qualitative data is stored differently depending on whether it is nominal, ordinal, or text data (e.g. a response to an open-ended survey question).
Qualitative data can be stored in text files (file extention: .txt) or in spreadsheets (file extention: .csv, .xlsx).
Quantitative data is usually stored in spreadsheets (file extention: .csv, .xlsx).
| Data type | Common file extensions |
| Time series | .csv, .xlsx |
| Image data | .png, .jpg, .bmp, .tiff |
| Spatial data | .shp, .json, .geojson |
| Audio data | .wav, .mp3 |
| Video data | .mkv, .avi, .mp4 |
The previous sections explained how data of various types and scales of measurement should be stored. However, data can be stored in different ways, it can happen that data is stored incorrectly, or in a way that is unsuitable for use.
Exercise: Type a date in Excel, e.g. “10-10”. When you press Enter, it should automatically correct to a date (10 October). See what happens when you change the cell’s format to Text or Number. Now redo the exercise by first setting the format of the cell to text, and then typing in “10-10”. What happens now?
Key takeaway: The same data can be stored on a computer in a variety of ways. It is up to you, as the analyst, to understand how the data should be stored for your analysis.
Give the data type (qualitative, quantitative discrete, quantitative continuous, or other) of the following variables and explain your answer.
If you are struggling to identify the data type of a variable, ask yourself the following questions:
Bonus question: Consider the variables Longitude and Latitude in the Provinces dataset. What kind of data do you think these variables are? What if you consider them together?